-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SCHEMA] Add TSV column files #827
Conversation
can I ask why using YAML? for many people BIDS is scary just because we have tsv and json ... this adds yet another layer -- I think we need to be careful |
The main reason we chose yaml over json is that yaml allows comments. Also, the schema files describing the specification aren't things that BIDS users will interact with, any more than they spend time with the |
ah I see - ok cool thx, just wanted to know more |
Made a mental note yesterday though that I will need to double check how this splitting of content in several files and format might affect "on-boarding" of people who want to contribute to specifications (especially for BEP content but also in general).
Will try to remember to raise the issue at the next maintainers meeting I attend.
…Sent from my Galaxy
|
Note that there are three very different uses of "name" columns, and two of them are equally common, so I chose not to specify any of them as the "canonical" definition.
Normally (speaking from experience), a technical schema is somehow self-contained. Indeed, the many yaml files may scare off contributors, but, as Taylor mentioned, they are rather intended for automatic processing via tooling |
If the intent is to provide a machine readable method, multiple files and directories on github is more difficult. It's pretty simple to download raw files, but the easiest way to manage the many schema files is to clone the whole repo. It's not an unmanageable obstacle, just a bit obnoxious. |
I will consolidate the column files into a single file if #883 is merged. I'm planning to hold off on changing the structure in this PR until that PR is handled. |
@tsalo Apologies for taking so long. Which PR should be considered higher priority? |
This PR contains content that needs to be reviewed, while the other really just needs a quick once-over from anyone who uses the schema. The changes in #883 include updates to the internal schema-processing code, so nothing in the specification should really change there. In summary, I think that #883 is higher priority, but this PR will need a more thorough review. |
@effigies just a small note to follow up from our call earlier this week. It looks like the JSON fields used to describe TSV columns (e.g., "LongName", "Description", "Levels") are included in the metadata YAML files, so they won't need to be added here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you are planning to add more to this but so far this looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I keep starting and having to drop this. Here are some comments in my queue before I start a new review.
- cell line | ||
- in vitro differentiated cells | ||
- primary cell | ||
- cell-free sample | ||
- cloning host | ||
- tissue | ||
- whole organisms | ||
- organoid | ||
- technical sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we automatically turn this into the following?
One of
"cell line"
,"in vitro differentiated cells"
, [...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think automatically incorporating value restrictions like enums would be awesome. I'd prefer to tackle it in a separate PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened #912 about this.
Hexadecimal. Label color for visualization. | ||
type: string | ||
unit: hexadecimal | ||
derived_from: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the column header "derived_from" seems fine, perhaps we should have a more descriptive term name:
derived_from: | |
sample_derived_from: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current approach to multiply-defined objects is to start with the base name and only create a new definition if a new use for that object is proposed with a definition that is incompatible with the original. I'd prefer not to link objects to specific datatypes or sections until it becomes necessary, but others may feel differently.
When we do have multiple definitions of the same object, I want to make sure that the names appear together and that it's clear where the difference from the "name" comes from, so I would lean toward something like derived_from__sample
. I added a section to the schema README about duplicate terms with different definitions, so hopefully that will be more clear now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the last remaining blocker, and what we decide here should apply across the schema. Here is what I added to the schema README. @effigies @Remi-Gau @sappelhoff WDYT?
If an object may mean something different depending on where it is used within the specification,
then this must be reflected in the schema.
Specifically, each version of the object must have its own definition within the relevant file.
However, since object files are organized as dictionaries, each object must have a unique key.
Thus, we append a suffix to each re-used object's key in order to make it unique.
For objects withCamelCase
names (for example, metadata fields), the suffix will start with a single underscore (_
).
For objects withsnake_case
names, two underscores must be used.There should also be a comment near the object definition in the YAML file describing the nature of the different objects.
For example, the TSV column
"reference"
means different things when used for EEG data, as compared to iEEG data.
As such, there are two definitions incolumns.yaml
for the"reference"
column:"reference__eeg"
and"reference_ieeg"
.# reference column for channels.tsv files for EEG data reference__eeg: name: reference description: | Name of the reference electrode(s). This column is not needed when it is common to all channels. In that case the reference electrode(s) can be specified in `*_eeg.json` as `EEGReference`). type: string # reference column for channels.tsv files for iEEG data reference__ieeg: name: reference description: | Specification of the reference (for example, 'mastoid', 'ElectrodeName01', 'intracranial', 'CAR', 'other', 'n/a'). If the channel is not an electrode channel (for example, a microphone channel) use `n/a`. anyOf: - type: string - type: string enum: - n/aWhen adding new object definitions to the schema,
every effort should be made to find a shared, common definition for the term, should it already exist.
If the differences between two versions of the same object are subtle or driven by context,
then you can generally append additional text to the object definition within the associated rendered table in the specification,
rather than creating a separate entry in the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a proposal for including common text in both samples and sessions.
of the strain of the species, for example: `RRID:IMSR_JAX:000664`. | ||
type: string | ||
format: rrid | ||
time: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea to give this a specific name like blood_sample_time
.
Since I think @sappelhoff covered *EG, I read all rendered pages except that. I have not gone through the actual schema file in detail or the Python (at least recently). Please let me know if you'd like me to. |
Co-authored-by: Chris Markiewicz <[email protected]>
Co-authored-by: Chris Markiewicz <[email protected]>
…ecification into tsv-columns-schema
{{ MACROS___make_columns_table( | ||
{ | ||
"participant_id": ("REQUIRED", "There MUST be exactly one row for each participant."), | ||
"species": "RECOMMENDED", | ||
"age": "RECOMMENDED", | ||
"sex": "RECOMMENDED", | ||
"handedness": "RECOMMENDED", | ||
"strain": "RECOMMENDED", | ||
"strain_rrid": "RECOMMENDED", | ||
} | ||
) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it accurate to say that this macro represents a BIDS rule for the `participants.tsv" file? If so, should this be described in the schema somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the tables do represent rules, but those rules are too complicated to be translated directly from the tables. For more information on the types of mechanisms we need to support in the schema to accurately codify the specification's rules, please see #620.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy with this, thanks Taylor.
Closes None.
Tagging @kayrobbins so she can take a look.
Changes proposed:
To do: